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CHAPTER 1. 


INTRODUCTION 


The study of concurrent operations is an important part of evaluating parallel processing 
machines. Several analytical and simulation studies, such as those found in [l], [2], [3], and [4], 
have been undertaken to evaluate such machines, but few if any investigate multiprocessor per- 
formance in a real workload environment. Such measurements are important for developing 
realistic techniques to measure and model concurrent behavior of system workloads. 

This project is concerned with the development of a measurement-based technique to study 
the use of loop-level concurrency in a real production workload. Measurements were performed 
on the Alliant FX/8 system at the Center for Supercomputing Research and Development 
(CSRD) at the University of Illinois at Urbana-Champaign. The FX/8 measured in the study is 
used primarily for development of numerical applications software. Programs developed on the 
machine range from high level software (FORTRAN), such as structural mechanics and circuit 
simulation, to assembly-level kernels for linear system solving [5], [6] . The Alliant is also 
networked to several other department machines. 

The two main objectives of this work are 1) to find the percentage of concurrent operations 
in the workload and the use of processing resources in these operations and 2) to study the sys- 
tem overheads associated with concurrency in the workload environment, including the effect of 
concurrency on other system performance measures. The methodology developed is general and 
in principle can be applied to study other parallel systems. The thesis first proposes measures for 
characterizing concurrency in the system. Probability distributions of the values of these meas- 
ures show the extent of concurrent operations in the workload. Particular attention is paid to 
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the end of concurrent loops, and the corresponding overheads in these periods., Regression tech- 
niques are used to assess the impact of increased system concurrency on other system measures, 
including bus utilization and cache miss rate. 

Results from workload measurement show that the system workload is concurrent 35% of 
the time, and that concurrent periods typically use all available processors. Measurements of the 
end of concurrent operations indicate an uneven use of processors during these periods. Joint 
analysis of concurrency and system performance measures such as cache miss rate and processor 
bus activity shows that the probability of high values of these measures increase as concurrency 
levels increase. Importantly, cache miss rate is shown to depend much more strongly on the frac- 
tion of parallel code in the workload than the number of processors active during concurrent 
operations. In particular, a 100% increase in the fraction of concurrent operations in the work- 
load results can result in greater than a 300% increase in cache miss rate. 

The following chapter presents background information and related research in the mul- 
tiprocessor performance evaluation area. Chapter 3 describes the measurement environment of 
the study, including the Alliant FX/8 and the instrumentation used. Chapter 4 shows results of 
workload measurements, with special attention to periods of changes in concurrency. Chapter 5 
deals with relationships between system measures and concurrency, and Chapter 6 summarizes 


results. 
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CHAPTER 2. 

BACKGROUND AND MOTIVATION 

Concurrent operations in multiprocessing systems typically exist at three levels. At a high 
level are processes or tasks, either independent or related, which execute on separate processors. 
A low level of parallelism is the pipelined vector processing available on supercomputers [7]. 
Several simultaneous arithmetic operations are possible in these machines, on data contained in 
special vector registers. An intermediate level is loop concurrency, in which multiple iterations of 
a program loop are assigned to separate processors. A large effort has been made to apply this 
level of program concurrency to groups of processors, for example, see [8] and [9]. This thesis is 
concerned with evaluation of loop concurrency under real workload conditions. 

In loop concurrent operations, processors may proceed independently, if there is no depen- 
dency between loop iterations. The efficiency of the execution of this loop (which may include 
several more nested loops) is dependent on how the number of total iterations matches the 
number of available processors; for example, assignments which require some small number of 
processors to execute in parallel, while others remain idle (because there are no more iterations to 
execute) result in lower efficiency [8]. Performance is also dependent on contention between pro- 
cessors for shared resources (.e.g., cache, memory). Sets of loop iterations that contain dependen- 
cies are further restricted in their execution. Dependencies may cause additional loop execution 
overhead, since processors may have to wait on those executing previous iterations to satisfy the 
dependence [10]. 
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2.1 Related Research 

There have been many studies of parallel machine performance and concurrency. Most 
have employed simulation and analytical-based techniques; examples are found in [l], [2], [3], and 
[4] . None of these studies have measured concurrency usage on a real workload of a multiprocess- 
ing system. Such measurements are important for determining the sensitivity of system perfor- 
mance to changes in the level of concurrency. The knowledge of these effects are essential for 
developing performance evaluation methodologies. Furthermore, the results can be applied to 
control strategies, such as processor scheduling, within the multiprocessor. 

Two common measures of performance for a multiprocessing computer are Speedup and 
Efficiency. Speedup is defined as S = T^/Tp, where is the execution time required for a pro- 
gram on a single processor, and T p is the execution of the program on P processors. Efficiency is 
given by the ratio E p = S p / P, 0 < E p < 1 [11]. Speedup measures obtained from measurements 
on the Alliant FX/8 are given in [12]. Speedup and Efficiency yield information about the 
improvement in a program, but they are unable to provide a detailed characterization of the pro- 
gram or system behavior. Importantly, when performance evaluation of a real production work- 
load is of interest, there is no direct applicability of the Speedup and Efficiency measures. 

Other studies of multiprocessor systems, such as [13] and [14], have investigated more 
detailed aspects of machine performance. In [14] it is shown that shared resource contention, 
which typically grows with the number of processors present in a multiprocessor, can be a limit- 
ing performance factor. In [15], measurements on the FX/8 deal specifically with the effect of the 
machine’s memory hierarchy. 

Research more closely related to the work in this thesis is presented in [16] and [17], These 
studies use hardware monitoring and special event marker instructions embedded in programs to 
acquire execution traces. Captured events on different processors are time-stamped, and the 
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composite trace yields information about the overlapping operations (concurrency) in the pro- 
gram. Since this technique requires specific code insertion in programs, it is difficult to apply to 
the observation of a real workload of programs generated by multiple users. 

None of the other measurement studies address the issues of evaluating the amount of con- 
currency in a workload, or relate this measured concurrency to the behavior of other system 
components, such as cache and main memory. This thesis studies three aspects of concurrency. 
The percentage of concurrency present in the workload is measured, along with the number of 
processors used during parallel operations. The end of loop-concurrent operations is studied to 
focus on overheads associated with concurrency. These operations are subject to performance 
degradations due to contention for shared resources, waiting associated with dependency resolu- 
tion, and less than full utilization of processing resources. The impact of concurrency on system 
performance measures is also analyzed to find the relationship between concurrency and other 
indices, including cache miss rate and processor bus activity. 

To investigate the issues described above, the Alliant FX/8 at CSRD was instrumented to 
extract data related to concurrency and other system performance measures. The following 
chapter includes a brief description of the FX/8 and its loop-concurrency mechanism, the instru- 
mentation, experiment setup, and the basic measurements made on the machine. 
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CHAPTER 3. 


EXPERIMENTAL ENVIRONMENT 


This chapter describes the Alliant FX/8, the instrumentation used for the measurements, 
and the experimental setup for the work. A more detailed description of the FX/8 may be found 
in Appendix C. 

3.1 System Description 

Measurements described here were performed on the Alliant FX/8 computer system. This 
machine is a shared-memory multiprocessor, with an advertised peak performance of 94.4 mil- 
lion floating point operations per second (MFLOPS) [18]. A diagram of the Alliant FX/8 
configuration used in the measurements is shown in Figure 1. 

Concurrency on the FX/8 is supported on the "Computational Cluster" (hereafter referred 
to as the Cluster) of eight Computing Elements (CEs). These processors have floating point and 
vector processing capabilities, and are linked through a common cache to shared memory. 
Interactive Processors (IPs) handle interactive traffic, operating system functions, and I/O. The 
machine supports an extension of 4.2 BSD UNIX, called Concentrix [18]. 

3.2 Alliant Concurrency 

This study is concerned with the measurement and evaluation of loop-level concurrency on 
the FX/8. The Alliant FORTRAN compiler attempts to transform DO loops or array operations 
into a parallel form, where the iterations of a loop will be executed on separate CEs. Figure 2 
shows how a concurrent loop is executed on the CE cluster. 
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A program is executed serially until a special instruction is encountered which enables the 
start of concurrent operation. Iterations of the DO loop are assigned to CEs in a self-scheduled 
fashion [19]. Processors are assigned iterations until all iterations are executed. As shown in the 
figure, the processor which executes the last iteration will continue serial execution after all itera- 
tions are complete, and need not be the same processor that entered the loop serially. The above 
loop execution may be complicated by dependencies which exist between iterations, where syn- 
chronization will be required between CEs to enforce correct program operation. 

The Alliant FORTRAN compiler attempts to transform DO loops or array operations into 
a parallel form, where the iterations of a loop will be executed on separate CEs. Both synchroni- 
zation and processor scheduling functions are handled in hardware, and make use of the Con- 
currency Control Bus shown in Figure 1 [18]. 

3.3 Instrumentation 

The Alliant FX/8 was instrumented at both the hardware and system software levels. 
Hardware measurements yielded information about CE concurrency and system bus activity, 
while software measurements consisted of counts of events logged by the operating system kernel. 
The two levels of measurements occurred simultaneously but independently. Minimal system 
overhead was incurred for measurements; the hardware monitoring is inherently non-intrusive, 
and statistics-gathering software was that which was normally running in Concentrix. A feature 
of this instrumentation approach was the fact that no modifications were required to the system 
in order to perform the measurements. When monitoring a real workload, instrumentation must 
normally gather data without relying on any special operations of the system or programs moni- 
tored, since modification of user and system programs for measurement purposes is not possible 
in many cases. 



9 


The hardware monitoring was accomplished with a Tektronix DAS 9100 Series logic 
analyzer [20]. This instrument acquires the state of up to 80 signals (on the unit used), and 
stores this data in a 512-deep buffer memory. The DAS is fully controllable through an i/o port; 
all experiments used this feature to control the instrument, as well as to transfer acquired buffers 
to files resident on the Alliant system. 

Probes from the DAS were connected to the FX/8 at three different logical points: 

1) Bus opcode was monitored for each CE, where the bus was that between the 
CE and the CE Cache, on the CE’s side of the crossbar switch. Bus opcode indi- 
cates what type of operation (read, write, idle, etc.) is occupying the bus. 

2) The shared memory bus opcode was monitored, yielding information about in- 
teractions between memory and cache and between multiple caches. 

3) The Concurrency Control Bus was also monitored, to determine whether a 
processor was active in concurrent operation, or not in a concurrent-active state. 


As mentioned above, the software measurements were those normally collected by the Con- 
centrix kernel, made available by a program written internally at CSRD to extract the values. 
The operating system logs counts continuously for a variety of memory management, scheduling, 
and interrupt variables. In this study, the measurement extracted was page faults generated by 


the CEs. 
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3.4 Experiment Setup 

The measurements were controlled by UNIX C-Shell script programs executing on the 
FX/8, which controlled collection of both the hardware and software data. The programs have 
the ability to configure the DAS monitor, enable the monitor’s triggering, transfer the data from 
the instrument to a host system, and reduce the acquired data to appropriate event counts (e.g., 
number of cache read operations). Table 1 shows the reduced set of events derived from a moni- 
tor buffer. 

In these experiments, the objective was to observe the CE Cluster; therefore we chose the IP 
as the computing resource for executing the measurement control software. The Concentrix sys- 
tem allows control over what type of computing resource (IP, CE, Cluster with 1 to 8 CEs, or 
don’t care) a program will run on, provided that resource has the correct capabilities for the pro- 
gram [21]. Using the IP kept measurement artifact at a minimum for our experiments. 

3.5 Measurements 

Two types of measurements were performed on the FX/8. The first used random sampling 
of the system to acquire data from the real workload on the machine. Nine sessions of this type 


HARDWARE MEASUREMENT EVENT COUNTS 

Name 

Event 


number of records with j processors active 

tBBtM 1 

number of records with processor j active 

mm 

number of records with CE bus opcode = j 

dmbopj 

number of records with mem bus opcode = j 


TABLE 1. Hardware Event Counts. 
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were performed on seven different midweek days, when the machine is used most heavily. Each 
session lasted between four and eight hours. Five snapshots of the system were taken and 
grouped together in a five-minute interval. A real-time program was written to condense the 
acquisition into event counts; the result was then written to disk. Software measurements were 
taken simultaneously with the hardware measurements. These were recorded at the time that 
the hardware sample was stored. 

A second group of measurements was executed in order to extract system behavior when the 
system was executing with high concurrency. These experiments dealt with hardware measure- 
ments only. High concurrency operation was captured by triggering the hardware monitor at 
times when the FX/8 Cluster was operating in a concurrent mode. Two different trigger events 
were used. In ten of the experiment sessions, the monitor was triggered when all eight processors 
in the Cluster were active, while in five other experiments, the transition from eight processors 
active to a smaller number active was the trigger event. This latter condition was chosen particu- 
larly to try to determine the behavior of the Cluster during times when the level of concurrency 
in the machine was changing. 

Processing of the measured data was performed with the Statistical Analysis System (SAS) 
package on an IBM 4381. SAS provides a large set of data analysis procedures, including graphi- 
cal presentation, regression, clustering, and analysis of variance [22]. 

The next chapter describes the analysis performed on the above measurements. The first 
section presents a definition of concurrency measures, and results obtained from the random sam- 
pling of workload. A description of periods of transition of concurrency is then detailed. 
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CHAPTER 4. 

ANALYSIS OF MEASURED DATA 

The analysis of the measurements in this study are presented in this chapter. Measurements 
are defined which assist in characterizing concurrency in the workload, and distributions of these 
measures for the acquired data are given. Analysis of periods of transitions in the level of system 
concurrency is also performed, in order to examine the overheads associated with these transi- 
tions. 

4.1 Concurrency Measures 

In order to quantify concurrency in the workload, certain measures are defined. The work- 
load is one choice of scope for the measures; they could easily be applied at other levels, such as a 
program or sub-program. 

We first define a measure c . as follows: 

J 

Cj = Prob(Number of Active Processors = j) (4.1) 

We call c- j-concurrency. From c-we derive the Workload Concurrency , which is the probability 

J J 

that there is any level of concurrency (2 or more processors operating in parallel) in the system: 

p 

— U c j (4*2) 

*«2 

The above measures deal with the amount of concurrency in the workload; we now restrict our 
attention to times when the system is operating concurrently. 
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c ; .i t = Prob [Number of Active Processors — j ] Number of Active Processors > 1) (4.3) 

c i measures the j-concurrency value for only those times when concurrency exists in the sys- 
J\ c 

tem. If all c • values from 2 to P are 0, this value is undefined. From c i we calculate the Mean 
J J i c 

Concurrency Level, which is the average number of processors operating concurrently when at 


least 2 processors are active. P is a measures utilization of computing resources during con • 

c 

currency, and may vary from 2 to P. 


Ps 


p 

Ei* e >\ 

J-2 


(4.4) 


The above measures may be applied at any level of multiprocessing capability of a given 
machine. In our experiments, we apply the measures to the specific case of loop-level con- 
currency in the overall workload of the multiprocessor. 


4.2 Workload Sampling Results 

Figure 3 shows the distribution of the number of active processors over all the measurement 
sessions. This figure shows the dominant concurrency states of the system as well as the ratio of 
concurrency to serial activity. The high points on the distribution at eight, one, and zero proces- 
sors active show that the CE Cluster spends the majority of its time in one of three states: full 
concurrency, serial, or idle. 1 This analysis was performed on data from individual measurement 
sessions and the sum of all sessions. Distributions of processor activity in individual sessions 
showed significant variation during different periods; examples for two sessions are shown in 
Appendix A. 


1 Idle in this context is with respect to Concurrent-Mode operation- Detached processes (exclusively serial) may constitute a por- 
tion of these states. 
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From the processor distribution, the concurrency measures defined in equations 4. 1-4.4 may 
be calculated. Table 2 shows these values for the sum of random-sample sessions. The value of 
C shows that concurrency in the workload is at 35% for the full set of measurement sessions. 
Eight processors are active 28% of the time in the overall workload, but when the system is con- 
current, the 8-active state predominates at 93% (c^ c ) resulting in a Mean Concurrency Level of 
7.66. 

For each sample (five minute period), a distribution of number of active processors was gen- 
erated, and the corresponding concurrency measures calculated. Figures 4 and 5 show the 


NUMBER OF PROCESSORS TOTAL 

\ 

8 | ******************************************** 42318.00 

1 

7 I** 377.00 

I 

8 I* 82.00 

I 

5 I** 337.00 

I 

4 l* 234.00 

I 

3 I*** 748.00 

l 

2 I**** 1514.00 

I 

1 | ************************** 23812.00 

I 

0 | ************************************************************************************ 82170 . 00 

I 

. + + + * * * + + — 

10000 20000 30000 40000 50000 60000 70000 80000 

TOTAL 

Figure 3. 


Number of Records with N Processors Active / All Sessions. 


CONCURRENCY MEASURES - ALL SESS] 

[ons 

e t 

C S 

u 

e 5 

c 6 

n 

c 8 


0.0100 

0.0049 

0.0015 

0.0022 

0.0005 


0.2795 

0.3506 

e *s 

e 9,c 

U\c 

c 5|c 

C «ie 

sa 

c S\e 

B 

0.0331 

0.0164 

0.0051 

0.0074 

0.0018 

0.0083 

0.9278 

7.66 


Table 2. Overall Concurrency Measures for All Sessions. 
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distribution of samples with particular values of Workload Concurrency and Mean Concurrency 
Level, respectively. Some points are of note in the distributions. The first is the large large per- 
centage of samples with Workload Concurrency measures at or near zero, indicating serial code 
execution or idle states during these periods. Some concurrency in the workload exists for 55% 
of the samples. Note that this number is significantly different from C found in the overall 
average. The reason for this is a difference between five-minute and overall average behavior. 
Note from Figure 4 that there are many samples with low but non-zero Workload Concurrency, 
which do not contribute significantly to the value of C for the total group of measurement ses- 
sions. For samples with non-zero C^, greater than 94% of samples have a Mean Concurrency 
Level higher than 6.5. (Recall that for any Workload Concurrency value of zero, Mean Con- 
currency Level is not calculated, since it is a measure of the system when concurrent). Hence, 
concurrency which does appear in the measured workload has a characteristically high utilization 
of the total available concurrency resource. 

While concurrency level is most often high, there are some periods which are not maximum. 
Clearly, overheads due to multiprocessing can contribute to this problem. Study of periods of 
change in concurrency can yield information about these overheads; this approach is described in 
the following section. 

4.3 Concurrency Transitions 

On the FX/8, transitions in concurrency typically happen at the end of a DO loop, when 
there are no remaining iterations to perform, and processors begin to become idle while waiting 
for serial execution to continue. These transitions affect the overall efficiency of parallel opera- 
tions. These idle periods correspond to a multiprocessing overhead; if the transition from P pro- 
cessors to one (serial) is instantaneous, processors do not incur any idle time, and Mean Con- 
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currency Level, P , is maximum. To investigate the efficiency during transition periods, a 
specific set of measurements was performed in which monitoring began when processor activity 
changed from all processors active (full-concurrency) to a lower concurrency level. 
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Figure 6 is the distribution of active processors for the concurrency transition periods. The 
transition of interest is between 8— concurrency and 1-concurrency; hence the distribution shows 
the number of records for 7 through 2 processors active. The state with 2 processors active 
accounts for 52% of the transition states. Average transition behavior, therefore, has transitions 
between 7 and 2 processors active which occur significantly faster than the transition from 2 pro- 
cessors to serial operation. 

Figure 7 shows the distribution of activity by individual processor during transitions. Pro- 
cessors 7 and 0 appear to be active significantly more often than the other processors, 
corresponding to the 2-concurrency peak in Figure 6, while processors 2, 3, and 4 are 
significantly less active than the others. 

A simple reason for uneven distribution of processor activity is a loop count which is I = 
8*j + 2, for integer j. If loop iterations throughout are equal in execution time, this would result 
in two iterations remaining after 8*j iterations have been executed, which would then take full 
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iteration times to complete. A particular number of "leftover" iterations occuring dominantly in 
the workload would seem unlikely, however. Second, processors executing a loop may not follow 
the same execution path, due to conditional branching which is iteration-dependent. The proces- 
sor executing the shorter path will obviously finish earlier, and be available for scheduling a new 
iteration while the longer iteration is still executing. Data access patterns are also likely to differ 
between iterations, which may result in differences in memory latency, if cache misses and/or 
page faults are generated by the processor. Such variation in latency causes processors to lead or 
lag one another. Dependencies in a program loop may also result in uneven distribution of 
activity among processors, as some processors are waiting more than others. Finally, contention 
for resources such as memory can contribute to uneven processor activity. If priority schemes 
favor particular processors, these will suffer greater delay, increasing the probability that they 
will trail other processors in execution at the end of the loop. 
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4.4 Discussion of Results 

Concurrency measures presented in this chapter describe what fraction of the workload is 
concurrent (C^) and how many processors are used during concurrent operations (P c ). Random 
sampling of the workload during nine different sessions show that C w = 0.35, indicating that 
about one-third of the workload is devoted to concurrent operations. An overall Mean Con- 
currency level of 7.66 shows that parallel operations have a high utilization of the machine’s pro- 
cessors. 

Transitions in concurrency, typically at the end of parallel loops, show uneven use of proces- 
sors. The 2-concurrent state is significantly more frequent than other "transition” states. In 
particular, CEs 7 and 0 tend to show more activity than other processors during transition; as 
other processors begin to become idle, these two typically continue to execute. Several reasons 
for this are possible, including a high frequency of loop counts that result in two "leftover" itera- 
tions. Also, uneven distribution of waiting time among processors, due to priority assignment for 
shared resources and/or loop iteration dependency may result in differences in activity between 
processors. 

The variation in system concurrency affects other system components related to processor 
activity; these include system busses and cache memory. Measured data is analyzed to determine 
this relationship in the following chapter. 
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CHAPTER 5. 


CONCURRENCY AND SYSTEM MEASURES 


In this chapter the effect of system concurrency measures on key system performances meas- 
ures is described. Analysis was performed for both the combination of random sampling and 
high concurrency measurement periods. Three measures, CE Bus Busy, Cache Miss Rate 
(Missrate), and Page Fault Rate were calculated from the acquired data. CE Bus Busy is the 
fraction of processor-to-cache bus cycles that are not idle in the measured interval. The value 
shown in the following analysis is the average value of this fraction over all eight busses, and is a 
measure of the total information flow between CEs and Cache. Missrate is the fraction of total 
bus cycles corresponding to cache misses. Page Fault Rate is the sum of user-mode and system- 
mode page faults generated by the CEs during the measurement interval. 

5.1 Cache Miss Rate 

Figures 8 and 9 show the scatter-plots of the cache miss rate, against both Workload Con- 
currency and Mean Concurrency Level. Similar plots for CE Bus Busy and Page Fault Rate are 
shown in Appendix B. An inspection of Figure 8 shows that the highest Missrate values occur at 
maximum Workload Concurrency. In addition, an increase Workload Concurrency appears to 
increase the probability of a high Missrate value. 

Figure 9 also shows some increasing probability of high Missrate as P increases, although 
the Missrate is relatively unchanged after P c > 7.0. 

The distributions for Missrate are plotted in Figures 10 and 11 for increasing values of C 

w 

and P . (See Appendix B for the distributions of CE Bus Busy and Page Fault Rate.) Note that 
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the median Missrate value for 0.4 < C <= 0.8 is .009, and increases sharply to 0.023 for C 

w w 

> 0.8. Median value of Missrate shows no increase between the middle and high ranges of P , 
indicating less sensitivity to this measure than C . 
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Figure 11 (a). Distribution of Miss Rate, P <= 6.0 
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Figure 11 (b). Distribution of Miss Rate, 6.0 < P ^ < = 7.5 
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Figure 11 (c). Distribution of Miss Rate, P > 7.5 
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While probability of a cache miss increases with Workload Concurrency, and to a lesser 
extent, Mean Concurrency Level, a high value of C or P c does not preclude a low Missrate 
value. Several observations exist at maximum C and low or zero Missrate. Periods of high 
Workload Concurrency and/or Mean Concurrency Level may generate low cache miss rates for 
several reasons. If the concurrent portion of the workload has well-behaved data and code local- 
ity, Missrate will obviously be low. Since each CE has an internal instruction cache (see Appen- 
dix C), loops and other program constructs which are local and "fit" in this cache will not gen- 
erate successive requests to the shared cache for instruction fetch. A high degree of register-to- 
register operations (which may include 32-element vector operations) will reduce data traffic 
between CE and cache, and consequently the average number of cache misses. Data dependency 
within concurrent loops may also reduce cache traffic. Processors are not required to access 
memory while waiting for synchronization, since this mechanism uses the physically separate 
Concurrency Control Bus [18]. Data and instruction locality across processors also will lessen the 
overall impact on the cache of higher concurrency in the workload. Data which is fetched to the 
shared cache for one CE and is soon needed by one or more additional processors will not result 
in additional misses for these processors. 

In summary, the distributions in this section show a general increase in cache miss rate with 
an increased amount of parallel code, and little relation between Missrate and the number of pro- 
cessors active within concurrent operations. In the following section, regression models are 
developed to quantify relationships between system and concurrency measures. 
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5.2 Regression Models 

Regression models were developed to quantify the median behavior of system measures, 

with respect to concurrency measures. A median point is calculated with respect to C ^ by 

finding the median of the system measure for the set of points clustered around their closest 

Workload Concurrency midpoint (0.0, 0.1, ... 1.0). The resulting set of coordinate pairs is then 

used to determine the model of the form system measure verses C^. The same technique was 

used to calculate the median system measure points for Mean Concurrency Level (midpoints = 

2.0, 3.0 ... 8.0), to develop models of system measure verses P . 

c 

Regression techniques were used to generate a fit of the median values described above and 
the corresponding concurrency measure midpoints. Second order linear models were determined 
to most accurately model the data. These models were of the form: 

System Measure = (3 X *C V 4 - 0 2 *cl + C (5-1) 

or 

System Measure = *P C + P 2 *P* 4- C (5-2) 

The regression model finds 0 , 0 and C such that the equation 

X M 

SSE = Z\ Vi - (C + firt + )] 2 (5.3) 

is minimized, where (x., y.) is a (concurrency measure, system measure) observation. One com- 
monly used measure of the accuracy of the model for predicting the data is given by R 2 , which 
indicates the amount of variability in the data predicted by the model 1 [23]. The results of the 
regression modeling are shown in Tables 3 and 4. 


1 R l values are categorized in [24] as: 0 = no relationship, 0.25 = moderately weak, 0.5 = moderate, 0.75 = moderately 

strong, 1.0 = perfect. 
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Regression Models 
System Measure vs. C ^ 
Model Parameters 

System Measure 

f. 


C 

HI 

Median Miss Rate 

-3.30 x 10~ 3 

2.57 x 10~ 2 

2.62 x 10" 3 

0.74 

Median CE Bus Busy 

'M' 

H9E93EHI 

2.47 x 10" 2 

0.89 

Median Page Fault Rate 

■E3SE91 

-1.02 x 10 4 

1.07 x 10 3 

0.65 


Table 3. Regression Models verses C ' 


Regression Models 
System Measure vs. P 
Model Parameters 

System Measure 

\ 


C 

HI 

Median Miss Rate 

5.05 x 10 -3 

-7.43 x 10 -4 

1.86 x 10” 2 

0.07 

Median CE Bus Busy 

HEHEB1 

-7.79 x 10 -3 

rrrrra 

0.66 

Median Page Fault Rate 

ES9SS91 

-5.28 x 10 2 

-2.53 x 10 4 

0.61 


Table 4. Regression Models verses P 


The plot of Missrate verses Workload Concurrency model is shown in Figure 12. The model 
predicts that an increase in C from 0.5 to 1.0 will be accompanied by a greater than triple 
increase in Missrate, from .007 to .024. While the scatter-plots showed that Missrate values vary 
over a wide range for Workload Concurrency, the fact that the median value is increasing shows 
that probability of higher values grows with C . 

A similar analysis was performed to estimate a relationship between CE Bus Busy and the 
concurrency measures. The model for this measure verses C ^ and P c are plotted in Figures 13 


* 


and 14. The figures show that the activity on the CE busses is generally increasing with both 
Workload Concurrency and Mean Concurrency Level. Both concurrency measures establish the 
fraction of time that processors may be active; median bus activity then follows this fraction. As 



































expected, the model predicts almost linear increase in bus activity with Workload Concurrency 
(i.e., the fraction of parallel code in the workload). With respect to Mean Concurrency Level, 
however, activity increases until P = 6.0, after which the Missrate levels off around 0.30. The 
results suggest that increased bus activity is more dependent on the percentage of parallel code in 
the workload (given by C ) than the degree of concurrency within parallel operations. 
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5.3 Discussion of Results 

The analysis in this chapter presents a model for Missrate verses Workload Concurrency 

which predicts a sharp increase in Missrate, from 0.007 to 0.024, between C w — 0.5 and C ^ — 

1.0. Little correlation between Missrate and P is seen. This means that Missrate is much more 

c 

sensitive to the fraction of parallel code in the workload than the number of processors active 
within parallel operations. The probable cause for this result is that the kinds of functions which 
are suitable for parallel encoding, such as matrix and concurrent vector operations, are usually 
much more data intensive than general serial code. This can result in higher data traffic for 
parallel codes (see the regression models for CE Bus Busy in the previous section), and a greater 
number of cache misses. 

As mentioned in section 5.1, locality of data and code across processors lessens the impact 
of additional processors within a parallel operation on cache misses. This is an explanation for 
Concurrency Level’s lack of relationship with Missrate. Processors executing iterations of con- 
current loops will typically follow similar instruction execution paths, ensuring good code local- 
ity. In addition, data access patterns between loop iterations will usually be related, lowering 
any additive effects of growth in P . 

CE Bus activity shows a near-linear growth with increasing Workload Concurrency. With 
respect to Mean Concurrency Level, Bus activity increases until reaching a maximum range at P 

c 

= 6.0. Increase with is explained as above; the inherent difference in concurrent and serial 
code results in greater traffic levels as grows. Relatively constant bus activity after P = 6.0 
is likely a reflection of a higher degree of dependence-related waiting in periods of maximum con- 
currency (all processors active) in the workload; such waiting will reduce bus traffic. 

The following highlights the key results arising from this work, and makes suggestions for 


future research. 
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CHAPTER 8. 


CONCLUSIONS 


This study used measurements on an Alliant FX/8 multiprocessor at the University of Illi- 
nois Center for Supercomputing Research and Development to evaluate concurrency found in a 
real workload on the machine. A systematic methodology was developed for characterizing the 
amount of concurrency present in the workload, and the effect of concurrency on system perfor- 
mance indices such as cache miss rate and bus activity. 

Two measures, Workload Concurrency and Mean Concurrency Level, were defined and then 
measured. Random sampling of the workload showed Workload Concurrency varies from 0 
(serial) to 1.0, and has an overall average of 35% for all measurement sessions. Idle, serial, and 
fully concurrent states dominate in the CE Cluster. Mean Concurrency Level, which measures 
the number of active processors during a concurrent operation, normally show a value close to 
maximum concurrency, or P = 8. 

Analysis of transition periods between 8-concurrency and lower concurrency levels showed 
that processor usage was uneven during these times; for the measured data, periods of 2- 
concurrency dominate the transition periods. Possible reasons for this include a large percentage 
of concurrent loops with 2 "leftover" iterations, uneven distribution of dependency waiting times, 
and unbalanced sharing of resources during concurrent operation, where one or both of these pro- 
cessors experiences greater delays than the remaining CEs. 

System measures, including cache miss rate and CE bus activity were analyzed with respect 
to concurrency measures to observe what relationships exist. It was shown that in general, the 
higher the value of C ^ or P ^ the higher the probability of increase in the system measure. For 
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cache miss rate , neither concurrency measure established a lower bound on any of the system 
measure’s value. 

Second order linear regression models were developed to find the relationship between 

P c , and median system measure behavior. A model for cache miss rate showed a reasonable fit 

verses C , and predicted an increase in the median value of Missrate from .007 to 0.024, while 

Workload Concurrency increased from 50% to 100%. Missrate showed low correlation with P , 

c 

yielding the result that Missrate is more strongly related to the fraction of parallel code in the 
workload than the level of concurrency within parallel operations. CE bus activity was also 
modeled verses both concurrency measures, and was seen to increase with both C and P ^ 
although less strongly in the highest range of Concurrency Level. 

The methodology and results presented here are useful for multiprocessor evaluation and 
optimization. In particular, understanding of machine characteristics in the presence of a real 
workload is important, since the complexity of parallel systems makes prediction of performance 
difficult. The techniques used here can be applied to other parallel processing systems, and be 
extended to other levels of concurrency and new performance indices. 

Similar studies are suggested, in order to obtain a wide range of representative practical 
results in concurrency evaluation. Future research in the measurement of concurrency should 
include evaluation of individual programs, to determine their behavior within the workload 
environment. Also, the relationship of concurrency and software-level parameters (such as those 
related to job scheduling) deserves attention. 
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Figure A.l. Number of Records with N Processors Active / Session 1. 
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Figure A.3. Distribution of Samples by CE Bus Busy. 
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Figure A.4. Distribution of Samples by Miss Rate. 
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Figure A.5. Distribution of Samples by Page Fault Rate. 
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APPENDIX B. 


CONCURRENCY VS. SYSTEM MEASURE DATA 
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38 


LECT*D: A — 1 CBS, B — I CBS, ETC. 
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Figure B.2. CE Bus Busy vs. Mean Concurrency Level. 
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Figure B.3 (a)* Distribution of CE Bus Busy, C <= 0.4 
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APPENDIX C. 

ALLIANT FX/8 DESCRIPTION 

The Alliant FX/8 supports two types of processors, an Interactive Processor (IP), and a 
Computational Element (CE). The machine may be configured with as little as 1 IP and 1 CE 
( FX/1 ), or up to 12 IPs and 8 CEs (F/X 8). The IP is based on a Motorola 68012 microprocessor. 
512 Kbytes of local memory is available to the IP; it is also directly connected to a 32 Kbyte IP 
cache (IPC), which is in turn connected to the system memory bus. IPs handle all system I/O 
through a Multibus system. The IPs’ function is support of interactive load, support of the 
operating system, and control of I/O. 

The CE has a base instruction set similar to the Motorola 68020 microprocessor. Addition- 
ally, the CE supports vector processing, floating point operation, and concurrent execution in the 
Computational Cluster, where loop-level multiprocessing takes place. Vector operations may 
occur simultaneously with multiprocessing on the Cluster. Each CE contains a 16 Kbyte instruc- 
tion cache for efficient handling of loops and other localized portions of code. The CEs share a 
four-way interleaved cache memory, with a total size of 128 Kbytes, divided into two Computa- 
tional Element Caches (CPCs). Connection to these cache modules is accomplished through a 
crossbar switch which routes both address and data between cache and CE. 

All data traffic between processors (CE or IP) and shared memory takes place through the 
processors’ respective caches. The caches maintain data coherency by requiring that a cache pos- 
sess a "unique" copy of data before modifying it. Traffic between caches and main memory is over 
two 64-bit wide data busses, with a total maximum bandwidth of 188 Mbytes per second. The 
main memory has an interleaving factor of four, and has a maximum size of 64 Mbytes. The 
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system’s virtual address spaces are organized as 1024 segments of 1024 pages per segment; pages 
are 4 Kbytes in length [18] [25], 

The operating system on the Alliant FX/8 is called Concentrix, and is an implementation of 
4.2 BSD UNIX, with extensions including support of multiple processors. Languages supported 
include C, F/X FORTRAN, and assembler. FORTRAN is the only high-level language which 
generates code using the Cluster concurrency feature; this function can also be accessed by the 
assembly language programmer. Programs may be specified to run on either the CE or the IP, 
(the latter only if floating point or vector processing is not required), or on the Cluster with a 
particular number of processors [21]. 
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